Accepting Failure: Availability through Repair-centric System Design
نویسنده
چکیده
Motivated by the lack of rapid improvement in the availability of Internet server systems, we introduce a new philosophy for designing highly-available systems that better reflects the realities of the Internet service environment. Our approach, denoted repair-centric system design, is based on the belief that failures are inevitable in complex, human-administered systems, and thus we focus on detecting and repairing failures quickly and effectively, rather than just trying to avoid them. Our proposal is unique in that it tackles the human aspects of availability along with the traditional system aspects. We enumerate a set of design techniques for building repair-centric systems, outline a plan for implementing these techniques in an existing cluster email service application, and describe how we intend to quantitatively evaluate the availability gains achieved by repair-centric design.
منابع مشابه
Reliability Analysis of Redundant Repairable System with Degraded Failure
This investigation deals with the transient analysis of the machine repair system consisting of M-operating units operating under the care of single repairman. To improve the system reliability/availability, Y warm standby and S cold standby units are provided to replace the failed units. In case when all spares are being used, the failure of units occurs in degraded fashion. In such situation ...
متن کاملAvailability of k-out-of-n: F Secondary Subsystem with General Repair Time Distribution
In this paper we study the steady state availability of main k-out-of-n: F and secondary subsystems having general repair time distribution. When more than k units of main subsystem fail, then the main subsystem shuts off the secondary subsystem. The life time distributions of the main units and that of secondary subsystem are exponentially distributed. A repair facility having single repairman...
متن کاملEmbracing Failure: A Case for Repair-centric System Design
Motivated by the lack of availability demonstrated by current approaches to building servers for the Internet environment, we argue for a new approach to building highly-available systems that better reflects the realities of the modern server environment, namely that failures of hardware, software, and humans are inevitable. Our approach, denoted repair-centric design, recognizes the inevitabi...
متن کاملFuzzy Reliability Evaluation of a Repairable System with Imperfect Coverage, Reboot and Common-cause Shock Failure
In the present investigation, we deal with the reliability characteristics of a repairable system consisting of two independent operating units, by incorporating the coverage factor. The probability of the successful detection, location and recovery from a failure is known as the coverage probability. The reboot delay and common cause shock failure are also considered. The times to failure of t...
متن کاملPerformance Modeling of Power Generation System of a Thermal Plant
The present paper discusses the development of a performance model of power generation system of a thermal plant for performance evaluation using Markov technique and probabilistic approach. The study covers two areas: development of a predictive model and evaluation of performance with the help of developed model. The present system of thermal plant under study consists of four subsystems with...
متن کامل